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Abstract 


We present a three-pronged approach to 
improving Statistical Machine Translation 
(SMT), building on recent success in the 
application of neural networks to SMT. 
First, we propose new features based on 
neural networks to model various non¬ 
local translation phenomena. Second, we 
augment the architecture of the neural net¬ 
work with tensor layers that capture im¬ 
portant higher-order interaction among the 
network units. Third, we apply multitask 
learning to estimate the neural network 
parameters jointly. Each of our proposed 
methods results in significant improve¬ 
ments that are complementary. The over¬ 
all improvement is +2.7 and +1.8 BLEU 
points for Arabic-English and Chinese- 
English translation over a state-of-the-art 
system that already includes neural net¬ 
work features. 


1 Introduction 


Recent advances in applying Neural Networks to 
Statistical Machine Translation (SMT) have gen¬ 
erally taken one of two approaches. They ei¬ 
ther develop neural network-based features that 
are used to score hypotheses generated from tra¬ 
ditional translation grammars (Sundermeyer et al., 
20141 [Devlin et al., 20141 [Auli et al., 20131 [Le 


et al., 201 2\ |Schwenk, 2012[ ), or they implement 
the whole translation process as a single neu¬ 
ral network (Bahdanau et al., 201 4\ Sutskever et 
al., 2014 1 . The latter approach, sometimes re 


ferred to as Neural Machine Translation, attempts 
to overhaul SMT, while the former capitalizes on 
the strength of the current SMT paradigm and 
leverages the modeling power of neural networks 
to improve the scoring of hypotheses generated 
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by phrase-based or hierarchical translation rules. 
This paper adopts the former approach, as /?-best 
scores from state-of-the-art SMT systems often 
suggest that these systems can still be significantly 
improved with better features. 

We build on (Devlin et al., 2014) who proposed 
a simple yet powerful feedforward neural network 
model that estimates the translation probability 
conditioned on the target history and a large win¬ 
dow of source word context. We take advantage 
of neural networks’ ability to handle sparsity, and 
to infer useful abstract representations automati¬ 
cally. At the same time, we address the challenge 
of learning the large set of neural network param¬ 
eters. In particular, 

• We develop new Neural Network Features 
to model non-local translation phenomena 
related to word reordering. Large fully- 
lexicalized contexts are used to model these 
phenomena effectively, making the use of 
neural networks essential. All of the features 
are useful individually, and their combination 
results in significant improvements (Section 

• We use a Tensor Neural Network Architecture 


(Yu et al., 2012) to automatically learn com¬ 
plex pairwise interactions between the net¬ 
work nodes. The introduction of the tensor 
hidden layer results in more powerful fea¬ 
tures with lower model perplexity and signif¬ 
icantly improved MT performance for all of 
neural network features (Section[3]). 


We apply Multitask Learning (MTL) (Caru- 


ana, 1997) to jointly train related neural net¬ 
work features by sharing parameters. This 
allows parameters learned for one feature to 
benefit the learning of the other features. This 
results in better trained models and achieves 
additional MT improvements (Section [4]). 

We apply the resulting Multitask Tensor Net¬ 
works to the new features and to existing ones, 





























obtaining strong experimental results over the 


strongest previous results of (Devlin et al., 2014 1 . 
We obtain improvements of +2.5 BLEU points 
for Arabic-English and +1.8 BLEU points for 
Chinese-English on the DARPA BOLT Web Fo¬ 
rum condition. We also obtain improvements of 
+2.7 BLEU point for Arabic-English and +1.9 
BLEU points for Chinese-English on the NIST 
Open 12 test sets over the best previously pub¬ 
lished results in (Devlin et al., 2014). Both the 
tensor architecture and multitask learning are gen¬ 
eral techniques that are likely to benefit other neu¬ 
ral network features. 


2 New Non-Local SMT Features 


Existing SMT features typically focus on local in¬ 
formation in the source sentence, in the target hy¬ 
pothesis, or both. For example, the n-gram lan¬ 
guage model (LM) predicts the next target word 
by using previously generated target words as con¬ 
text (local on target), while the lexical translation 
model (LTM) predicts the translation of a source 
word by taking into account surrounding source 
words as context (local on source). 

In this work, we focus on non-local transla¬ 
tion phenomena that result from non-monotone re¬ 
ordering, where local context becomes non-local 
on the other side. We propose a new set of power¬ 
ful MT features that are motivated by this simple 
idea. To facilitate the discussion, we categorize the 
features into hypothesis-enumerating features that 
estimates a probability for each generated target 
word (e.g., n-gram language model), and source- 
enumerating features that estimates a probability 
for each source word (e.g., lexical translation). 

More concretely, we introduce a) Joint Model 
with Offset Source Context (JMO), a hypothesis 
enumerating feature that predicts the next target 
word the source context affiliated to the previous 
target words; and b) Translation Context Model 
(TCM), a source-enumerating feature that predicts 
the context of the translation of a source word 
rather than the translation itself. These two mod¬ 
els extend pre-existing features: the Joint (lan¬ 


guage and translation) Model (JM) of (Devlin et 
al., 2014) and the LTM respectively respectively. 
We use a large lexicalized context for there fea¬ 
tures, making the choice of implementing them as 
neural networks essential. We also present neural- 
network implementations of pre-existing source- 
enumerating features: lexical translation, orien¬ 


tation and fertility models. We obtain additional 
gains from using tensor networks and multitask 
learning in the modeling and training of all the fea¬ 
tures. 


2.1 Hypothesis-Enumerating Features 

As mentioned, hypothesis-enumerating features 
score each word in the hypothesis, typically by 
conditioning it on a context of n- 1 previous tar¬ 
get words as in the n-gram language model. One 
recent such model, the joint model of 
(2014| ) achieves large improvements 
of-the-art SMT by using a large context window 
of 11 source words and 3 target words. The Joint 
Model with Offset Source Context (JMO) is an 
extension of the JM that uses the source words 
affiliated with the n-gram target history as con¬ 
text. The source contexts of JM and JMO over¬ 
lap highly when the translation is monotone, but 
are complementary when the translation requires 
word reordering. 

2.1.1 Joint Model with Offset Source Context 

Formally, JMO estimates the probability of the tar¬ 
get hypothesis E conditioned on the source sen¬ 
tence F and a target-to-source affiliation A: 


Devlin et al. 
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where e, is the word being predicted; e*I ] t+1 is the 
string of n — 1 previously generated words; C a ._ k 
to the source context of m source words around 
/ a ._ fc , the source word affiliated with We 

refer to k as the offset parameter. We use the def¬ 


inition of word affiliation introduced in Devlin et 


al. (2014]). When no source context is used, the 


model is equivalent to an n-gram language model, 
while an offset parameter of k = 0 reduces the 
model to the JM of Devlin et al. (2014) . 

When k > 0, the JMO captures non-local con¬ 
text in the prediction of the next target word. More 
specifically, &,;_/■ and e,;, which are local on the 
target side, are affiliated to / 0 ._ fc and f ai which 
may be distant from each other on the source side 
due to non-monotone translation, even for k = 1. 
The offset model captures reordering constraints 
by encouraging the predicted target word e* to fit 
well with the previous affiliated source word f ai _ k 
and its surrounding words. We implement a sep¬ 
arate feature for each value of k, and later train 





















them jointly via multitask learning. As our ex¬ 
periments in Section 5.2.1 confirm, the history- 
affiliated source context results in stronger SMT 
improvement than just increasing the number of 
surrounding words in JM. 

Fig. [T] illustrates the difference between JMO 
and JM. Assuming n = 3 and m = 1, then JM 
estimates P(e b |e 4 ,e 3 Xa 5 — {/6,/7,/s})- On 
the other hand, for k = 1 , JMOfc=i estimates 
-P(e 51 ^ 4 , e 3 ,C a4 = {fs, /g, /io})- 


/5 


63 


£l a: 


/6 ( A ) /8 


e4 (e5) e6 



Figure 1: Example to illustrate features, /f is the 
source segment, el is the corresponding transla¬ 
tion and lines refer to the alignment. We show 
hypothesis-enumerating features that look at fr 
and source-enumerating features that look at e 5 . 
We surround the source words affiliated with e b 
and its n-gram history with a bracket, and sur¬ 
round the source words affiliated with the history 
of es with squares. 


get word l(fj ) = e&. given a source context Cj, 
bj E B is the source-to-target word affiliation as 
defined in (Devlin et al., 2014 1 . When fj is trans¬ 
lated to more than one word, we arbitrarily keep 
the left-most one. The target word vocabulary V 
is extended with a NULL token to accommodate 
unaligned source words. 


Orientation model (ORI) describes the proba¬ 
bility of orientation of the translation of phrases 
surrounding a source word fj relative to its own 
translation. We follow (Setiawan et al ., 2013] ) 
in modeling the orientation of the left and right 
phrases of fj with maximal orientation span (the 
longest neighboring phrase consistent with align¬ 
ment), which we denote by Lj and R :j respec¬ 
tively. Thus, o(fj) = (o Lj (fj),o Rj (fj)), where 
ol and or refer to the orientation of Lj and /ir¬ 
respectively. For unaligned fj, we set o(fj) = 
ol } (/ij), the orientation of R :l with respect to Lj. 

Fertility model (FM) models the probability that 
a source word f 3 generates <P(fj) words in the 
hypothesis. Our implemented model only dis¬ 
tinguishes between aligned and unaligned source 
words (i.e., <f{fj) E {0,1}). The generalization of 
the model to account for multiple values of (f>(fi) 
is straightforward. 


2.2.2 Translation Context Model 


2.2 Source-Enumerating Features 

Source-Enumerating Features iterate over words 
in the source sentence, including unaligned words, 
and assign it a score depending on what as¬ 
pect of translation they are modeling. A source- 
enumerating feature can be formulated as follows: 

1*1 

P(E\F.A)^l[P(Y ] \C 3 = f j +Z) 

3 =1 

where C 0j is the source context (similar to the 
hypothesis-enumerating features above) and Yj 
is the label being predicted by the feature. We 
first describe pre-existing source-enumerating fea¬ 
tures: the lexical translation model, the orientation 
model and the fertility model, and then discuss a 
new feature: Translation Context Model (TCM), 
which is an extension of the lexical translation 
model. 


As with JMO in Section 2.1.1 we aim to cap¬ 


ture translation phenomena that appear local on 
the target hypothesis but non-local on the source 
side. Here, we do so by extending the LTM 
feature to predict not only the translated word 
eft., but also its surrounding context. For¬ 
mally, we model P(l(f 3 )\Cj), where l{fj) = 
Cft -d, • • ■ ,e b .,---e b +d is the hypothesis word 


window around e bj . In practice, we decompose 

-\-d 

TCM further into ]~[ P(e b . +d >\Cj) and imple- 

d’=-d 3 

mented each as a separate neural network-based 
feature. Note that TCM is equivalent to the LTM 
when d = 0. Because of word reordering, a given 
hypothesis word in l(fj) might not be affiliated 
with fj or even to the words in Cj. TCM can model 
non-local information in this way. 


2.2.3 Combined Model 


2.2.1 Pre-existing Features 

Lexical Translation model (LTM) estimates the 
probability of translating a source word fj to a tar¬ 


Since the feature label is undefined for unaligned 
source words, we make the model hierarchical, 
based on whether the source word is aligned or 












not, and thus arrive at the following formulation: 

P(l(fa)) ■ P(ori(fa)) ■ P^fa)) = 

r p(Mfj) = Q)-P(o Lj (R j )) 

< P(Mfj) > 1) • ff P(e bj+d .) 

d'=-d 

k •-P(ol j (/ 7 ),o Rj .(/j)) 

We dropped the common context (C :l ) for readabil¬ 
ity. 

We reuse Fig. [T] to illustrate the source- 
enumerating features. Assuming d = 1, the scores 
associated with fa are P(<f>(fa) > 1 IC 7 ) for the 
FM; P(e 4 |C 7 ) • P(e 5 |C 7 ) • P(e 6 )|C 7 ) for the TCM; 
and P{o(fa) = (oLrifa) = RA, o R7 (fa) = RA)) 
for the ORh/i’/l refers to Reverse Adjacent). L- 
and Ry (i.e. fa and /| respectively), the longest 
neighboring phrase of fa, are translated in reverse 
order and adjacent to e§. 


where Ui[k\, the fc-th slice of Ui, is a square ma¬ 
trix. 


In our implementation, we follow (Yu et al., 


2012; Flutchinson et al., 20 13) and use a low-rank 
approximation of Uj [/;;] = Qi[k\ ■ Ri[k] T , where 
Qi[k], Ri[k\ € M nxr . The output of node k be¬ 
comes: 


hi[k] = a (hi_ 1 • Qi[k] ■ Ri[k] T ■ hj_^) 

In our experiments, we choose r = 1, and also 
apply the non-linear activation function a distribu- 
tively. We arrive at the following three equations 
for computing the hidden layer outputs (0 < l < 
L): 


vi = <J (hi -1 • Qi) 
v\ = <J (fa -1 • Ri) 
fa = vi® v[ 


3 Tensor Neural Networks 

The second part of this work improves SMT by 
improving the neural network architecture. Neural 
Networks derive their strength from their ability to 
learn a high-level representation of the input auto¬ 
matically from data. This high-level representa¬ 
tion is typically constructed layer by layer through 
a weighted sum linear operation and a non-linear 
activation function. With sufficient training data, 
neural networks often achieve state-of-the-art per¬ 
formance on many tasks. This stands in sharp con¬ 
trast to other algorithms that require tedious man¬ 
ual feature engineering. For the features presented 
in this paper, the context words are fed to the net¬ 
work network with minimal engineering. 

We further strengthen the network’s ability to 
learn rich interactions between its units by intro¬ 
ducing tensors in the hidden layers. The multi¬ 
plicative property of the tensor bares a close re¬ 
semblance to collocation of context words which 
are useful in many natural language processing 
tasks. 

In conventional feedforward neural networks, 
the output of hidden layer l is produced by mul¬ 
tiplying the output vector from the previous layer 
with a weight matrix (Wi) and then applying the 
activation function cr to the product. Tensor Neu¬ 
ral Networks generalize this formulation by using 
a tensor Ui of order 3 for the weights. The output 
of node k in layer l is computed as follows: 

hi[k] = a (fa -1 • Ui[k\ ■ hj_^) 


where hi-\ is double-projected to vi and v(, 
and the two projections are merged using the 
Hadamard element-wise product operator 0. 

This formulation allows us to use the same in¬ 
frastructure of the conventional neural networks 
by projecting the previous layer to two different 
spaces of the same dimensions, then multiply¬ 
ing them element-wise. The only component that 
is different from conventional feedforward neural 
networks is the multiplicative function, which is 
trivially differentiable with respect to the learnable 
parameters. Figure [3jb) illustrates the tensor ar¬ 
chitecture for two hidden layers. 

The tensor network can learn collocation fea¬ 
tures more easily. For example, it can learn a col¬ 
location feature that is activated only if hi -1 [i] col¬ 
locates with hi-] [j] by setting F^[/c][z][j] to some 
positive number. This results in SMT improve¬ 
ments as we describe in Section [5] 

4 Multitask Learning 

The third part of this paper addresses the challenge 
of effectively learning a large number of neural 
network parameters without overfitting. The chal¬ 
lenge is even larger for tensor network since they 
practically doubles the number of parameters. In 
this section, we propose to apply Multitask Learn¬ 
ing (MTL) to partially address this issue. We im¬ 
plement MTL as parameter sharing among the net¬ 
works. This effectively reduces the number of pa¬ 
rameters, and more importantly, it takes advan¬ 
tage of parameters learned for one feature to better 
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Figure 2: The network architecture for (a) a conventional feedforward neural network, (b) tensor hidden 
layers, and (c) multitask learning with M features that share the embedding and first hidden layers 
(t = l). 


learn the parameters of the other features. Another 
way of looking at this is that MTL facilitates reg¬ 
ularization through learning the other tasks. 

MTL is suitable for SMT features as they model 
different but closely related aspects of the same 
translation process. MTL has long been used by 
the wider machine learning community (Caruana, 


1997) and more recently for natural language pro¬ 


cessing (Collobert and Weston, 2008; |Collobert 


et ah, 2011). The application of MTL to ma¬ 


chine translation, however, has been much less re¬ 
stricted, which is rather surprising since SMT fea¬ 
tures arise from the same translation task and are 
naturally related. 

We apply MTL for the features described in 
Section [2j We design all the features to also share 
the same neural network architecture (in this case, 
the tensor architecture described in Section [3]) and 
the same input, thus resulting in two large neural 
networks: one for the hypothesis-enumerating fea¬ 
tures and another for the source-enumerating ones. 
This simplifies the implementation of MTL. Us¬ 
ing this setup, it is possible to vary the number 
of shared hidden layers t from 0 (only sharing the 
embedding layer) to L — 1 (sharing all the layers 
except the output). Note that in principle MTL is 
applicable to other set of networks that have differ¬ 
ent architecture or even different input set. With 
MTL, the training procedure is the same as that of 
standard neural networks. 

We use the back propagation algorithm, and use 
as the loss function the product of likelihood of 
each featureQ 


'in this and in the other parts of the paper, we add the 
normalization regularization term described in (Devlin et al.,| 
(2014| l to the loss function to avoid computing the normaliza¬ 
tion constant at model query/decoding time. 


M 

Loss = 

i j 

where Xi is the training sample and Y) is one of 
the M models trained. We use the sum of log like¬ 
lihoods since we assume that the features are inde¬ 
pendent. 

Fig. |3jc) illustrates MTL between M models 
sharing the input embedding layer and the first 
hidden layer (t = 1) compared to the separately- 
trained conventional feedforward neural network 
and tensor neural network. 


5 Experiments 

We demonstrate the impact of our work with ex¬ 
tensive MT experiments on Arabic-English and 
Chinese-English translation for the DARPA BOLT 
Web Forum and the NIST OpenMT12 conditions. 


5.1 Baseline MT System 

We run our experiments using a state-of-the-art 


string-to-dependency hierarchical decoder (Shen 
|et al., 20T0] ). The baseline we use includes a set 
of powerful features as follow: 

• Forward and backward rule probabilities 

• Contextual lexical smoothing ( (Devlin, 2009 1 

• 5-gram Kneser-Ney LM 


Dependency LM (Shen et al., 2010) 


Length distribution ( jShen et al., 2010 1 
Trait features ( (Devlin and Matsoukas, 2012] ) 


Factored source syntax (Huang et al., 2013) 
Discriminative sparse feature, totaling 50k 
features ( |Chiang et al., 2009 1 
Neural Network Joint Model (NNJM) and 
Neural Network Lexical Translation Model 

















































































(NNLTM) ( [Devlin et al.,~20l4l ) 

As shown, our baseline system already includes 
neural network-based features. NNJM, NNLTM 
and use two hidden layers with 500 units and use 
embedding of size 200 for each input. 


We use the MADA-ARZ tokenizer (Habash et 


al., 2013) for Arabic word tokenization. For Chi¬ 


nese tokenization, we use a simple longest-match- 
first lexicon-based approach. We align the training 
data using GIZA++ (Och and Ney, 20031. For tun¬ 
ing the weights of MT features including the new 
features, we use iterative fc-best optimization with 


an ExpectedBLEU objective function (Rosti et al., 


2010), and decode the test sets after 5 tuning iter¬ 


ation. We report the lower-cased BLEU and TER 
scores. 


5.2 BOLT Discussion Forum 

The bulk of our experiments is on the BOLT Web 
Discussion Forum domain, which uses data col¬ 
lected by the LDC. The parallel training data con¬ 
sists of all of the high-quality NIST training cor¬ 
pora, plus an additional 3 million words of trans¬ 
lated forum data. The tuning and test sets consist 
of roughly 5000 segments each, with 2 indepen¬ 
dent references for Arabic and 3 for Chinese. 


5.2.1 Effects of New Features 

We first look at the effects of the proposed features 
compared to the baseline system. Table [T] summa¬ 
rizes the primary results of the Arabic-English and 
Chinese-English experiments for the BOLT condi¬ 
tion. We show the experimental results related to 
hypothesis-enumerating features (HypEn) in rows 
S 2 -S 5 . those related to source-enumerating fea¬ 
tures (SrcEn) in rows Sq-Sq, and the combination 
of the two in row S 10 • For all the features, we set 
the source context length to m = 5 (11-word win¬ 
dow). For JM and JMO, we set the target context 
length to n = 4. For the offset parameter k of 
JMO, we use values 1 to 3. For TCM, we model 
one word around the translation (d = 1). Larger 
values of d did not result in further gains. The 


baseline is comparable to the best results of (De¬ 
vlin et a l., 2014| ). 

In rows S 3 to S 3 , we incrementally add a model 
with different offset source context, from k = 1 
to k = 3. For AR-EN, adding JMOs with differ¬ 
ent offset source context consistently yields pos¬ 
itive effects in BLEU score, while in ZH-EN, it 
yields positive effects in TER score. Utilizing all 
offset source contexts “+JMOfc< 3 ” (row S 5 ) yields 


around 0.9 BLEU point improvement in AR-EN 
and around 0.3 BLEU in ZH-EN compared to 
the baseline. The JMO consistently provides bet¬ 
ter improvement compared to a larger JM con¬ 
text (row S 2 ), validating our hypothesis that using 
offset source context captures important non-local 
context. 

Rows Sq to Sg present the improvements that 
result from implementing pre-existing source- 
enumerating SMT features as neural networks, 
and highlight the contribution of our translation 
context model (TCM). This set of experiments is 
orthogonal to the HypEn experiments (rows S 2 - 
S r> ). Each pre-existing model has a modest pos¬ 
itive cumulative effect on both BLEU and TER. 
We see this result as further confirming the cur¬ 
rent trend of casting existing SMT features as neu¬ 
ral network since our baseline already contains 
such features. The next row present the results 
of adding the translation context model, with one 
word surrounding the translation (d = 1). As 
shown, TCM yields a positive effect of around 
0.5 BLEU and TER improvements in AR-EN and 
around 0.2 BLEU and TER improvements in ZH- 
EN. 

Separately, the set of source-enumerating fea¬ 
tures and the set of target-enumerating features 
produce around 1.1 to 1.2 points BLEU gain in 
AR-EN and 0.3 to 0.5 points BLEU gain in ZH- 
EN. The combination of the two sets produces a 
complementary gain in addition to the gains of the 
individual models as Row (,S)o) shows. The com¬ 
bined gain improves to 1.5 BLEU points in AR¬ 
EN and 0.7 BLEU points in ZH-EN. 


System 

AR-EN 

ZH-EN 

BL 

TER 

BL 

TER 

S\: Baseline 

43.2 

45.0 

30.2 

58.3 

S 2 : -S'l+JM^Cg 

43.5 

45.0 

30.2 

58.5 

S 3 : 5i+JMOfc=i 

43.9 

44.7 

30.8 

57.8 

S 4 : 5 3 +JMO fc=2 

43.9 

44.7 

30.7 

57.8 

S 5 : S 4 +JMO fc=3 

44.4 

44.5 

30.5 

57.5 

S 6 : Si+LTM 

43.5 

44.7 

30.3 

58.0 

S 7 : S 6 +ORI 

43.7 

44.6 

30.4 

57.8 

S 8 : S 7 +FERT 

43.8 

44.7 

30.5 

57.8 

Sg: Sg+TCM 

44.3 

44.2 

30.7 

57.5 

Sn> S 9 +JMO fc < 3 

44.7 

44.1 

30.9 

57.3 


Table 1: MT results of various model combination 
in BLEU and in TER. 





























5.2.2 Effects of Tensor Network and 
Multitask Learning 

We first analyze the impact of tensor architecture 
and MTL intrinsically by reporting the models’ 
average log-likelihood on the validation sets (a 
subset of the test set) in Table [2] As mentioned, we 
group the models to HypEn (JM and JMOfco) and 
SrcEn (LTM, ORI,FERT and TCM) as we perform 
MTL on these two groups. Likelihood of these 
two groups in the previous subsection are in col¬ 
umn “NN” (for Neural Network), which serves as 
a baseline. The application of the tensor architec¬ 
ture improves their likelihood as shown in column 
“Tensor” for both languages and models. 



Feat. 

Independent 

MTL 

NN 

Tensor 

t = 0 

L = 2 

t = 1 

L = 3 

Pi 

< 

HypEn 

SrcEn 

-8.85 

-8.47 

-8.54 

-8.32 

-8.35 

-8.10 

-8.09 

X 

N 

HypEn 

SrcEn 

-11.48 

-10.77 

-11.06 

-10.66 

-10.87 

-10.54 

-10.49 


Table 2: Sum of the average log-likelihood of the 
models in HypEn and SrcEn. t = 0 refers to MTL 
that shares only the embedding layer, while t = 1 
shares the first hidden layer as well. L refers to the 
network’s depth. Higher value is better. 

The likelihoods of the MTL-related experi¬ 
ments are in columns with “MTL” header. We 
present two set of results. In the first set (col¬ 
umn “MTL,t=0,L=2”), we run MTL for features 
from column “Tensor” by sharing the embedding 
layer only (t = 0). This allows us to isolate 
the impact of MTL in the presence of Tensors. 
Column “MTL,t=l,l=3” corresponds to the exper¬ 
iment that produces the best intrinsic result, where 
each model uses Tensors with three hidden lay¬ 
ers (500x500x500, l = 3) and the models share 
the embedding and the first hidden layers (t = 1). 
MTL consistently gives further intrinsic gain com¬ 
pared to tensors. More sharing provides an extra 
gain for SrcEn as shown in the last column. Note 
that we only experiment with different l and t for 
SrcEn and not for HypEn because the models in 
HypEn have different input sets. In our experi¬ 
ments, further sharing and more hidden layers re¬ 
sulted in no further gain. In total, we see a consis¬ 
tent positive effect in intrinsic evaluation from the 
tensor networks and multitask learning. 

Moving on to MT evaluation, we summarize the 


experiments showing the impact of Tensors and 
MTL in Table [3] For MTL, we use L = 3, t = 2 
since it gives the best intrinsic score. Employing 
tensors instead of regular neural networks gives a 
significant and consistent positive impact for all 
models and language pairs. For the system with 
the baseline features, we use the tensor architec¬ 
ture for both the joint model and the lexical trans¬ 
lation model of Devlin et al. resulting in an im¬ 
provement of around 0.7 BLEU points, and show¬ 
ing the wide applicability of the tensor architec¬ 
ture. On top of this improved baseline, we also ob¬ 
serve an improvement of the same scale for other 
models (column “Tensor”), except for HypEn fea¬ 
tures in AR-EN experiment. Moving to MTL ex¬ 
periments, we see improvements, especially from 
SrcEn features. MTL gives around 0.5 BLEU 
point improvement for AR-EN and around 0.4 
BLEU point for ZH-EN. When we employ both 
HypEn and SrcEn together, MTL gives around 0.4 
BLEU point in AR-EN and 0.2 BLEU point in 
ZH-EN. In total, our work results in an improve¬ 
ment of 2.5 BLEU point for AR-EN and 1.8 for 
BLEU point in ZH-EN on top of the best results in 
(Devlin et al., 2014 1 . 


5.3 NIST OpenMT12 

Our NIST system is compatible with the 
OpenMT12 constrained track, which consists of 
10M words of high-quality parallel training for 
Arabic, and 25M words for Chinese. The n-gram 
LM is trained on 5B words of data from the En¬ 
glish GigaWord. For test, we use the “Arabic-To- 
English Original Progress Test” (1378 segments) 
and “Chinese-to-English Original Progress Test + 
OpenMT12 Current Test” (2190 segments), which 
consists of a mix of newswire and web data. 
All test segments have 4 references. Our tuning 
set contains 5000 segments, and is a mix of the 
MT02-05 eval set as well as additional held-out 
parallel data from the training corpora. 

We report the experiments for the NIST con¬ 
dition in Table [4] In particular, we investigate 
the impact of deploying our new features (column 
“Feat”) and demonstrate the effects of the ten¬ 
sor architecture (column “Tensor”) and multitask 
learning (column “MTL”). As shown the results 
are inline with the BOLT condition where we ob¬ 
serve additive improvements from adding our new 
features, applying tensor network and multitask 
learning. On Arabic-English, we see a gain of 2.7 












Feature set 

AR-EN 

ZH-EN 

NN 

Tensor 

MTL 

NN 

Tensor 

MTL 

R \: Baseline Features 

43.2 

43.9 

- 

30.2 

30.8 

- 

i? 2 - R\ + HypEn 

44.4 

44.4 

44.5 

30.5 

31.5 

31.3 

R 3 : R\ + SrcEn 

44.3 

44.9 

45.5 

30.7 

31.5 

31.9 

i? 4 : R\ + HypEn + SrcEn 

44.7 

45.3 

45.7 

30.9 

31.8 

32.0 


Table 3: Experimental results to investigate the effects of the new features, DTN and MTL. The top 
part shows the BOLT results, while the bottom part shows the NIST results. The best results for each 
conditions and each language-pair are in bold. The baselines are in italics. . 



Base. 

Feat 

Tensor 

MTL 

AR-EN 

53.7 

55.4 

55.9 

56.4 

mixed-case 

51.8 

53.1 

53.7 

54.1 

ZH-EN 

36.6 

37.8 

38.2 

38.5 

mixed-case 

34.4 

35.5 

35.9 

36.1 


Table 4: Experimental results for the NIST condi¬ 
tion. Mixed-case scores are also reported. Base¬ 
lines are in italics. 


The features we propose in this paper address 
the major aspects of SMT modeling that have 
informed much of the research since the origi¬ 
nal IBM models (Brown et al ., 19 93): lexical 
translation, reordering, word fertility, and lan¬ 
guage models. Of particular relevance to our work 
are approaches that incorporate context-sensitivity 


into the models (Carpuat and Wu, 2007), formu 


late reordering as orientation prediction task (Till 


man, 2004) and that use neural network language 


BLEU point and on Chinese-English, we see a 1.9 models (Bengio et al., 2003 Schwenk, 


BLEU point gain. We also report the mixed-cased 

Schwenk, 2012), and incorporate source-side con- 

BLEU scores for comparison with previous best 

text into them (Devlin et al., 2014; Auli et al., 

published results, i.e. Devlin et al. (2014) report 

2013; Leetal., 2012 Schwenk, 2012). 


52.8 BLEU for Arabic-English and 34.7 BLEU for 
Chinese-English. Thus, our results are around 1.3- 
1.4 BLEU point better. Note that they use addi¬ 
tional rescoring features but we do not. 

6 Related Work 


Approaches to incorporating source context into 
a neural network model differ mainly in how they 
represent the source sentence and in how long is 
the history they keep. In terms of representa¬ 
tion of the source sentence, we follow (Devlin et 


Our work is most closely related to Devlin et al. 
(2014). They use a simple feedforward neural 


al., 2014) in using a window around the affiliated 
source word. To name some other approaches, 


Auli et al. (2013) uses latent semantic analysis and 


network to model two important MT features: A 
joint language and translation model, and a lex¬ 
ical translation model. They show very large 
improvements on Arabic-English and Chinese- 
English web forum and newswire baselines. We 
improve on their work in 3 aspects. First, we 
model more features using neural networks, in¬ 
cluding two novel ones: a joint model with off¬ 
set source context and a translation context model. 
Second, we enhance the neural network architec¬ 
ture by using tensor layers, which allows us to 
model richer interactions. Lastly, we improve the 
performance of the individual features by training 
them using multitask learning. In the remainder 
of this section, we describe previous work relat¬ 
ing to the three aspect of our work, namely MT 
modeling, neural network architecture and model 
learning. 


source sentence embeddings learned from the re¬ 
current neural network; Sundermeyer et al. (2014) 
take the representation from a bidirectional LSTM 


recurrent neural network; and Kalchbrenner and 


Blunsom (20131) employ a convolutional sentence 


model. For target context, recent work has tried 


to look beyond the classical n-gram history. (Auli 


et al., 20T3j Sundermeyer et al., 2014|) consider 


an unbounded history, at the expense of making 
their model only applicable for N-best rescoring. 
Another recent line of research (Bahdanau et aU 


2014^ Sutskever et al., 2014) departs more rad¬ 


ically from conventional feature-based SMT and 
implements the MT system as a single neural net¬ 
work. These models use a representation of the 
whole input sentence. 

We use a feedforward neural network in this 
work. Besides feedforward and recurrent net- 






































































works, other network architectures that have been 
applied to SMT include convolutional networks 
( jKalchbrenner et ah, 2014] ) and recursive networks 
(Socher et al., 2011). The simplicity of feedfor¬ 
ward networks works to our advantage. More 
specifically, due to the absence of a feedback loop, 
the feedforward architecture allows us to treat 
individual decisions independently, which makes 
parallelization of the training easy and the query¬ 
ing the network at decoding time straightforward. 
The use of tensors in the hidden layers strengthens 
the neural network model, allowing us to model 
more complex feature interactions like colloca¬ 
tion, which has been long recognized as impor¬ 
tant information for many NLP tasks (e.g. word 
sense disambiguation (Lee and Ng, 2002)). The 
tensor formulation we use is similar to that of 
(Yu et al., 201 2\ [Hutchinson et al., 2013) ). Ten¬ 
sor Neural Networks have a wide application in 
other field, but have only been recently applied in 
NLP ( |Socher et al., 2013] |Pei et al., 2014| ). To 
our knowledge, our work is the first to use tensor 
networks in SMT. 

Our approach to multitask learning is related to 
work that is often labeled joint training or transfer 


learning. To name a few of these works, Finkel 


and Manning (2009) successfully train name en¬ 


tity recognizers and syntactic parsers jointly, and 
Singh et al. (2013) train models for coreference 
resolution, named entity recognition and relation 
extraction jointly. Both efforts are motivated by 
the minimization of cascading errors. Our work 
most closely related to Collobert and Weston 


is 


(20081 Collobert et al. (2011), who apply multi¬ 


task learning to train neural networks for multi¬ 
ple NLP models: part-of-speech tagging, semantic 
role labeling, named-entity recognition and lan¬ 
guage model variations. 


7 Conclusion 

This paper argues that a relatively simple feedfor¬ 
ward neural network can still provides significant 
improvement to Statistical Machine Translation 
(SMT). We support this argument by presenting a 
multi-pronged approach that addresses modeling, 
architectural and learning aspects of pre-existing 
neural network-based SMT features. More con¬ 
cretely, we paper present a new set of neural 
network-based SMT features to capture important 
translation phenomena, extend feedforward neu¬ 
ral network with tensor layers, and apply multi¬ 


task learning to integrate the SMT features more 
tightly. Empirically, all our proposals successfully 
produce an improvement over state-of-the-art ma¬ 
chine translation system for Arabic-to-English and 
Chinese-to-English and for both BOLT web fo¬ 
rum and NIST conditions. Building on the suc¬ 
cess of this paper, we plan to develop other neural- 
network-based features, and to also relax the lim- 
iteation of current rule extraction heuristics by 
generating translations word-by-word. 
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